Group Testing with Random Pools: 
optimal two-stage algorithms 

Marc Mezard, Cristina Toninelli 

Abstract — We study Probabilistic Group Testing of a set of N items each of which is defective with probability p. We 
focus on the double limit of small defect probability, p < 1, and large number of variables, N > 1 , taking either p — > 
after N — > oo or p = 1/N f) with /3 e (0, 1/2). In both settings the optimal number of tests which are required to identify 
with certainty the defectives via a two-stage procedure, T(N,p), is known to scale as Np\ logp\. Here we determine the 
sharp asymptotic value of T(N,p)/(Np\ \ogp\) and construct a class of two-stage algorithms over which this optimal 
value is attained. This is done by choosing a proper bipartite regular graph (of tests and variable nodes) for the first 
stage of the detection. Furthermore we prove that this optimal value is also attained on average over a random bipartite 
graph where all variables have the same degree, while the tests have Poisson-distributed degrees. Finally, we improve 
the existing upper and lower bound for the optimal number of tests in the case p = l/N fi with f3 e [1/2, 1). 

Index Terms — Group testing, reconstruction algorithms 

♦ 



1 Introduction 

The aim of Group Testing is to detect an un- 
known subset of defective (also referred to as 
positive or active) items out of a set of objects 
by means of queries (the tests) in the most 
efficient way. In other words we are given a 
set of objects, O, which contains an unknown 
subset of defectives, V, and the task is to 
identify V by means of the fewest possible 
number of tests. Tests are queries of the form 
"Does the pool Q (where Q is a subset of O) 
contain at least one positive item?". This prob- 
lem was originally introduced in relation with 
efficient mass blood testing [1]. Afterwards, it 
has been also applied in a variety of situa- 
tions in molecular biology: blood screening for 
HIV tests [2], screening of clone libraries [3], 

[4], sequencing by hybridization [5], [6], 

Furthermore it has proved relevant for fields 
other than biology including quality control in 
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product testing [7], searching files in storage 
systems [8], data compression [9] and more re- 
cently in the context of data gathering in sensor 
networks [10]. We refer to [11], [12] for reviews 
on the different applications of GT. Here we 
will deal with the very much studied gold- 
standard case, namely the idealized situation 
in which tests are perfect: there can be neither 
false positives nor false negatives in the test 
answers. It is important to keep in mind for 
future work that, however, in many biological 
applications one should include the possibility 
of errors in the test answers. 

Before presenting our results we recall some 
standard classifications of GT problems. First of 
all a GT problem can be either Combinatorial 
or Probabilistic. Combinatorial GT refers to the 
situation in which V can be any member of 
a predetermined class of sets in O. The task 
is here to find the algorithm which requires 
the minimal number of tests to determine V 
in the worst case. In probabilistic GT we are 
given a configuration space S and a probability 
distribution /ionS and the set of objects O (and 
therefore the corresponding V) is chosen in S 
according to /i. In this case the task is to opti- 
mize the expected (with respect to /i) number of 
tests required to determine V. Furthermore in 
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both combinatorial and probabilistic GT there 
is an additional classification which concerns 
the number of stages, i.e. parallel queries, in 
the detection procedure. For one-stage (or fully 
non-adaptive) algorithms all tests are speci- 
fied in advance: the choice of the pools {Q} 
does not depend on the outcome of the tests 
(and therefore does not depend on O). For 
several biological applications a non-adaptive 
procedure would in principle be the best one. 
Indeed the test procedure can be destructive 
for the objects and repeated tests on the same 
sample require more sophisticated techniques. 
However the number of tests required by fully 
non-adaptive algorithms can be much larger 
than for adaptive ones. The best compromise 
for most screening procedures [13] is therefore 
to consider two-stage algorithms with a first 
stage containing a set of predetermined pools 
(tested in parallel) and a second stage whose 
pools are chosen depending on the outcomes 
of the first stage (and therefore on the choice of 
O). For Probabilistic GT the only possibility to 
detect all defectives with such a procedure is to 
choose a trivial two-stage algorithms [13] which 
individually tests on the second stage all the 
variables which are left undetermined by the 
first stage. Here we will consider Probabilistic 
Group Testing when p, is Bernoulli product 
measure and we will analyze the performance 
of two stage algorithms as a function of the 
overall number of objects, N, and of the prob- 
ability that a chosen object is defective, p. In 
particular we will analyze the relevant limit of 
small p and large N, which has already been 
investigated in ( [13]-[16]). A detailed account 
of our new contributions follows. 



2 Notation and results 

We consider Probabilistic Group Testing in the 
Bernoulli p-scheme: the configuration space is 
S = {0,1}^, namely the set of all vectors 
X = (xi, . . . , xn) with Xi = {0,1}, and the 
probability measure is Bernoulli product mea- 
sure /ip with marginal fi p (xi = 1) = p, namely 
H P (X) = llf =1 p Xi (l —p) l ~ Xi . For a given choice 
of X we say that variable i is (is not) defective 
or positive if Xi — 1 (xi = 0). 



A test of the type "Does pool Q contain at 
least a defective?" corresponds here to asking 
whether the value of the random variable con- 
structed as an OR function among the variables 
of the pool equals one or zero. More precisely 
we will call "pool a" an N component binary 
vector P a = (cia, C2 a , ■ ■ ■ , Ova) 

with Ci, a E {0,1} 
and we will say that variable i belongs (does 
not belong) to pool a if c iiQ = 1 (c iya = 0). With 
this notation we will call "test a" the random 
variable T a E {0, 1} with T a = if c i)0 Xi = 
for all % = (1, . . . , N), T a — 1 otherwise. In 
other words T a is the OR function among the 
variables that belong to pool a. 

For a given choice of the variables, X, and a 
set of M pools, {P a }, a = 1, • • • M, we say that: 
(a) variable i is a sure zero if there exists at 
least one a E (1, . . . , M) such that: i belongs to 
pool a and T a = (b) variable i is a sure one 
if there exists at least one a E (1, . . . , M) such 
that: i belongs to pool a, T a — 1 and all the 
other variables j, j ^ i, which belong to pool a 
are sure zeros. 

It is obvious that if i is a sure zero then x^ = 
and if i is a sure one then Xj = 1. Note that 
however the converse is not true: there can be 
a zero (one) variable that is not a sure zero (sure 
one, respectively). Indeed, any given choice of 
X, M and {P a } a = (1, . . . , M), identifies the 
following subsets of V :— {1, . . . , N}. 
(i) The zeros and the ones 

Z := (i : % E V; x { = 0) 

V:=(i:iEV; x t = l)=V\Z; 

(i) The sure zeros and the sure ones 

AI 

Z D S := (i : i E V; [] = °) 

o=l 

M 

V D S, := (i :% E V; £ c i>a Xi J[ (l So (j))^ a > 0) 

a=l jjti 

(here and throughout this paper we define 
0° = 1 and 1a stands for the characteristic 
function of set A);. 

(iii) The undetermined zeros and the undeter- 
mined ones 

Uo :=Z\S , U x :=V\S X . 



3 



The two-stage algorithms that we consider are 
composed by a first stage of parallel tests and 
a second stage of individual tests over the vari- 
ables whose value has been left undetermined 
by the first stage. Therefore the choice of the 
algorithm is completely defined by fixing the 
number of tests in the first stage, M G N, 
and by choosing the pools {P a }, a — 1, . . . , M. 
The latter corresponds to fixing an iV x M 
matrix C NjM with binary entries c ia e {0, 1} 
which give the i-th component of vector P a . 
This will be called the connectivity matrix. In 
other words, the choice of the algorithm cor- 
responds to fixing a couple (M,Cn,m), namely 
choosing a bipartite graph G = Q(Cn,m) with N 
variable nodes and M test nodes. The number 
of tests required to identify the defectives (i.e. 
to decode the value of X), T(X, M, Cn,m), is 
therefore given by the number of tests in the 
first stage plus the number of variables which 
are left undetermined by them, namely 



T(X,M,C 



NMj 



M+\U \ + |Wi 



(1) 



Note that U and U\ depend in general on 

X, M and C N ^ M - We will denote by T m ,c n , m ,p 
the mean of T{X, M, Cn,m) over the Bernoulli 
distribution fj, p for X, namely 



Tri,c 



NM , p :=M+Y,^(X)(\U \ + \m- (2) 

xes 



In this probabilistic setting the first important 
issue is to determine the optimal value T(N,p) 
of T M fi NMtP over all two-stage algorithms, i.e. 
over all choices of M and Cn m 



T(N,p) :-- 



mm T M ,c 



(3) 



where minimization is restricted to M = 
(1, . . . , N) (it is obvious that the optimal value 
can never be attained at M > N + 1). 

Here we will study this problem in the 
relevant limit of small defective probability, 
p <C 1, which has already been investigated 
in [13]-[16]. We will denote by limAr^oo|/3 the 
limit where N goes to oo, p goes to zero, with 
p = N~P and (3 > 0, i.e. 



lim J(N,p) : 



A? 



lim f{N,N~ 



(4) 



We will also study the limit lim p ^ li m Af^oo 
and, in order to lighten the presentation of our 
results, we will refer to this case as the (3 = 
case: 



lim f(N,p) 



lim lim f(N,p). (5) 



Our main contributions are the following 
results for the asymptotics of T(N,p), which 
will be proved in Section [4] and |5l respectively. 



Theorem 1: When f3 e [0, 1/2), 
T(N,p) 1 



lim 

N^oo\p Np\ logp| (log 2) 
Theorem 2: When /3 > 1/2, 



(6) 



1 < Km m*K<e. 



(log 2) 7V-*oo|/3 Np\ \ogp\ 



(7) 



To our knowledge the best previously known 
bound for < (3 < 1 were 



^< , im T(N,P) 4 
log2 iV^oo|/3 Np\ \ogp\ (3 



(8) 



which have been obtained in [14]: the lower 
bound via the information theoretic bound and 
the upper bound by the explicit construction 
of a decoding algorithm based on a random 
choice of the pools. 

Our results determine the sharp asymptotics 
of T(N, p)/(Np\ log p\) for the cases p = N'? 
with f3 e (0,1/2) and for the cases p — > 
after N — ► oo. Furthermore, they sharpen the 
previously existing bounds for p = N~P with 
1/2 < [3 < 1. 

A second relevant issue is the explicit con- 
struction of an asymptotically optimal algo- 
rithm, namely the identification of a family of 
couples (M,Cn,m) such that 



Tm,c nm ,p = T(N,p) 

jV^oo|/3 Np\ logp\ N^oc\f3 Np\ logp\ 



lim 



(9) 



Here, for each (3 with < {3 < 1/2, we identify 
a (/3-dependent) family of couples (M,Cn,m) 
which satisfies ®. Let 



log p | /log 2] 



(10) 
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M =: [iVp|logp|/(log2) 5 



(11) 



where [x] stands for the integer part of x. We 
construct a pooling design based on a "regular- 
regular " bipartite graph with N variable nodes, 
M test nodes, L tests per variable and K = 
NL/M variables per tests, and with girth (i.e. 
length of the shortest graph cycle) at least 6. 
This means that the corresponding connectivity 
matrix satisfies the following constraints 



N 



VaG (1,...,M) : Y, c i,* = K ( 12 ) 
i=i 

M 

Vie(l,...,iV): £q, 6 = L (!3) 

6=1 

E X! Cj,bCj,dCi, b ci td = (14) 

l<j<l<N l<d<b<M 

In Section H] we will prove that any such 
connectivity matrix is asymptotically optimal, 
namely 

Theorem 3: Let C 1 ,^ be such that conditions 
CE2), (0 and CS) are satisfied. If < (3 < 1/2, 
then: 

M,C L —,p 1 

lim _ -JM£ . = _ L,. (15) 



7v-^oo|/3 Np\ \ogp\ (log 2) 

Notice that the family of graphs satisfying 
the requested properties is non-empty under 
the conditions for (3 stated in the theorem, 
thanks to a constructive procedure found in 
[17], as we shall discuss in the proof. 

Furthermore we have proved that the opti- 
mal value is also attained asymptotically by 
some random pool designs whose construc- 
tion is much simpler than the one of [17]. 
Let P^~ul denote the distribution of bipartite 
"regular-Poisson" graphs with iV variable nodes, 
M test nodes, and a fixed number of tests, 
L, randomly connected to each variable node. 
Explicitly: 



Pn,m,l(Cn,m) 



N 



i=l 



with 



M 



if <V = L 



a=l 



otherwise. (17) 



Note that, when one takes the large N limit 
with L < iV and L < M, the degrees of 
the tests become iid random variables with a 
Poisson distribution of mean K = NL/M. If 
we make the choice L = L and M = M as in 
((TO) and CCD the following result, whose proof 
is provided in section |5l holds 

Theorem 4: When < (3 < 1 /2 



T- 



M ' C N,M>P 



N^oo\/3 



C 



N,M 



Np\\ogp\ (log 2 



,2 ■ 



(18) 

Finally, we provide a random class of con- 
nectivity matrices for which our upper bound 
in © for the case p = 1/N? with 1/2 < /3 < 1 
is attained. Let P^~ml denote the distribution 
of random bipartite graphs with N variable 
nodes, M test nodes and / tests per variable (k 
variables per test), with I Poisson distributed 
and with mean L (k Poisson distributed with 
mean K = NL/M), namely 



pn,m,l( c n,m) - n n 



'L_ 
M 



a=l i=l 



L 

M 



_ (19) 

If we make the choice L = L and M = M with 



L =: [e|logp|] 
M =: [eiVp|logp|] 



(20) 
(21) 



the following result, whose proof is given in 
section [6l holds 



Theorem 5: When < (3 < 1 



lim y P 



M,C ~,p 



Af^oo|/3 



O 



nm,l> n > mJ Np\ logp 



e. (22) 



X\P{ci^...,c iM ) (16) Hm 



Furthermore, the choice (|20|) -(|2T |) for the cou- 
ple (M, L) is optimal over all the Poisson- 
Poisson distributions, namely 

Remark 1: When < (3 < 1, for any M, L 

Wn ^ p > e. (23) 



c, 



Np\ \ogp\ 



Note that the Poisson-Poisson distribution 
had already been used in [14] to obtain the 
upper bound on T(N,p) which we have re- 
called in formula ©. Here, by optimizing the 
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choice of the parameters L and M for P^f L , 
we ameliorate the upper bound © which was 
obtained in [14] by choosing M = ANp\ log(p)| 
and L — 2| logp|. 

Furthermore the result of Remark [T] to- 
gether with Theorem [T] imply that the optimal 
value for (3 e [0,1/2) can never be attained 
on the class of Poisson-Poisson distributions, 
while it can be attained both on the class of 
regular-Poisson distributions and on the class 
of regular-regular graphs with girth at least 6 
(see Theorem HJ). 

The outline of the paper is the following. 

In section [3] we establish a lower bound on 
Tm,c nm , p (Theorem [6} which holds for any iV 
and p — > 0. In section [4] we prove Theorem 
[3] which, together with Theorem ® completes 
the proof of Theorem [1] and identifies a set of 
algorithms for which the asymptotic value of 
Tm,c nm ,p i s attained if (3 G [0, 1/2). In section [5] 
we prove Theorem |4] which identifies a differ- 
ent class of random algorithms over which this 
asymptotic value is also attained. In Section 
[6] we prove Theorem [5] which, together with 
Theorem [6} completes the proof of Theorem 
[2] and identifies an algorithm over which our 
upper bound on T MjCnm>p is attained when 
1/2 < (3 < 1. 



as 



N M 
i=l a=l 

N m- N 
M™) ■= E — + <?° mi exp (- £ ma p ) 



i=2 



where d a is the degree of test a, i.e. 

N 

da ■= ^2^ a 

1=1 

and, for % = 2, . . . N, 

a* := | log (l - (1 - p)*" 1 
We also let 



(26) 



(27) 



A p :— 


min 






U(p) 


:= min 


r6[2,cxD; 


c(p) :■■ 


= min 


lug [0,oo) 



(28) 



T^Yi <29) 



In order to prove Theorem [6] we will use the 
following results, whose proofs are postponed 
to the next sections. 

Lemma 1: 



3 Lower bound on T(N,p) 

In this section we prove the following lower 
bound on T which holds for any N whenever 
we let p — > 0. 

Theorem 6: 



p— o jVp| logp| (log 2)^ 



(24) 



When one takes the limit limjv^ooi/3 with /5 e 
[0, 1), Theorem [6] improves the previously exist- 
ing lower bounds [14] on T(N,p). Furthermore 
for all the cases (3 E [0, 1/2) it allows, together 
with Theorem S] to determine the exact value 
of lim A r^ 00 | /3 T/(A r p| logp|). On the other hand 
when (3 > 1 better bounds then the one given 
by our (|24j> already exist [13], [14]. 

Let q:=(l- p) and Z% := {0, 1, ... , N}, we 



define B 



M,C N}M ,P 



and the function A p : (Z 



N) 



(a) For any choice of M and Cn,m the ex- 
pected number of undetected zeros, U , is lower 
bounded by 



J2^ P (X)\U \>B 



M,Cn,m,P - 



(31) 



x 



(b) If the girth of the graph Q(C N>M ) is larger 
or equal to 6, ((31) holds as an equality. 



Lemma 2: 



min (M + B M ,c N , M , P ) 



> NA P . 



Lemma 3: 

Ap>min(l,c(p)). (32) 

Claim 1: When p ^ 0, Z7(p) = ^ + ©(p 2 )- 
Proof of Theorem \6ji 
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By using definition ©, Lemma [T] and the where 
trivial inequality J2x MpC^QI^il > 0/ we g e t 



T, 



M,CV m ,p 



>M + B 



M,Cn,m,P- 



(33) W i>a (X) 



Notice that for all the choices of p that we 
consider, the bound will not suffer from the 
fact that we neglected the contribution from 
undetermined ones. This can be seen from the 
facts 

Y J ^x)\u 1 \<Y J ^ P {x)\v\=pN 

X X 

and T M ,c N>M ,p ^ — Xp\og 2 p (this is the informa- 
tion theoretic lower bound [14]),which imply 
that for any (3 G [0, 1): 

N^co\f3 T M ,C NtM ,P 

Since ((33]) holds for any (M, Cn,m), by using 
Lemma |2] and |3] it follows immediately that 



T(N,p) > iVmin(l, c(p)). 



(34) 



From definition (|30|) , as an immediate corol- 
lary of Claim [U, we get 



(log2) 2 



2|log(log2)|)- 



P 



+ o(p) 



(log2) 2 

(35) 

in the limit p — > 0. By gathering the re- 
sults 1(34)) and ((351) the proof of Lemma © is 
concluded. Furthermore, we get the following 
lower bound for the corrections 

1 - 2| log(log2)| < T(N,p) - Np\ logp|(log2)- 2 



(log 2)J 



3.1 Proof of Lemma [H 



Np 



(36) 

□ 



(a) By definition the set of undetected zeros, 
Uo, contains all the variables i such that Xi = 
and T a — 1 for any a such that c iA = 1, i.e. 
% belongs only to pools containing at least a 
variable equal to one. Therefore: 

N M 

J2^ P (x)\u \ = J2J2fi p (x)(i - Xl ) n wu*) 

X i=l X a=l 

(37) 



n (i 



Xj) Cj ' a 



3=1,. ..N 



) 



(38) 



Since W iA {X) does not depend on Xj we can 
immediately perform the average over this 
variable for each term of the sum in ((37) . Then, 
for each given i, we introduce the partial order 
-<i according to which x x' if and only if 
Xj < x'j for all j E {(1,...,A^) \ i}. For any 
Cn,m and for any a G (1, . . . , M), W iya is a non- 
decreasing function with respect to this partial 
order, namely x x' implies that W iia (X) < 
H^ ;0 (X'). Therefore inequality ((3T) follows by 
applying FKG inequality [18] to each term of 
the sum in ((37) . In other words, we have simply 
used the positive correlation among the events 
that there exists at least one variable equal to 
one in two (or more) intersecting pools. 

(b) If the bipartite graph has girth at least 
6, i.e. if M, Ctv,m are such that for any couple 
of variables there exists at most one test which 
contains both of them (see condition (O), the 
events defined above are independent. There- 
fore ((31) holds with the equality sign. □ 



3.2 Proof of Lemma |2] 

Given a choice (M, Cn,m), we define for each 
variable i the vector m L = (m\, . . . , m % N ) G 
(Zh) n where m % - denotes the number of tests 
which contain variable i and globally contain j 
variables 



M 



c i,a<5j,G 



a=l 



where d a is defined in ((27) . (and 5 is the 
Kronecker 5). Then we define for each rh = 
(mi, . . . , mjv) G (Zjv)^ a density, /(m), such 
that Nf(m) is the number of variables i for 
which m l = rh. With this notation we can 
rewrite the number of tests, M, and the defi- 
nition (O for B Mt c NM , p as 



A' 



m j'=l 

B M ,c N<M>P = NqY,f(™)P(m) 



(39) 
(40) 
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where here (and whenever it appears in the 
following) the sum over m is performed on 

m £ (Z^) N and 



N 



P(m) :=n(l-^ _1 ) 



(41) 



Let also 



M 



a=l 



with this notation and using ([39]> and (|40]> it is 
immediate to check that 



M + B 



M,C n ,m 



,p = #p(/. 



M,Cn,m 



(42) 



where, for any couple (M,C NyM ), /, 



M,Cn,m 



N) 



(0, 1/JV, . . . iV/iV) is defined by 



N N 



fM,c NM (m) :=N- ly £Il8« 



i=i j=i 



and 



B P (f) ■ = 
Nj^f(m) 



(43) 



(44) 



N 



N 



J2 — + qo mi eM-Y, m i4) 

m Li=l 2 i=2 

Therefore by using (|42|) and definition (|26|) we 
get 



min (M + fi^ 



>M,c JV>M>P ) - min B p (f M ,c NiM ) 
> inf £L(/) > AT min AJm) = NA V 

(45) 

where T is the set of probability functions on 



(Z+) N , i.e. / : (Z 



A' / 



With Ernf(™) 



1. The second inequality immediately follows 
from the definition (|26|) and the fact that / is a 
probability distribution. □ 

The following remark will be used to con- 
struct an optimal algorithm in section [4] 

Remark 2: Define g G T as 



g{m 



with m e 



{1 if m = m 
otherwise (46) 

) N such that 

log p | /log 2] if i = [log2/p] 



Then: 



B p {g) _ 1 
Np\ logp| (log 2)' 



+ o(p). 



(48) 



Furthermore g coincides with /m,c nm 
on all bipartite graphs with 



<|43J 

M = iV[p|logp|/(log2) 2 ] = M tests and 
connectivity matrix Cn,m such that the 
number of tests per variable is fixed equal to 

[|logp|/log2] = L. 

3.3 Proof of Lemma |3]and Claim Q] 

Proof of Lemma [3} Let m be the vector over 
which A p is reached. We consider separately 
the two complementary cases: (a) mi > 1 
and (b) mi = 0. In case (a), the minimum is 
obviously larger or equal to one. Therefore 



where 



A p > min(l, b p ) 



bp := min A p (m) 

rhe(Z N )N 

mi=0 



(49) 



We now enlarge the minimization of m ; to all 
real positive values M + = [0, oo), and introduce 
the two functions on 



mm) 



N 

i=2 
N 



E 

i=2 



m;CL r 



(50) 
(51) 



A simple bound on b is expressed in terms of 
these functions: 

bp > _ min v _ i (w(m) + (1 - p) e -^ ( ™ ) ) . (52) 



mH 



This minimization is carried out in two steps. 
We first fix v p (m) = w > 0, and look for the 
minimum of u in the subspace v p (m) = w. 
Let us denote by u*(w,p) this minimum value. 
Finding u* is a problem of linear optimiza- 
tion. So the minimum must be obtained on 
one of the vertices of the simplex of (M + ) Ar ~ 1 
defined by v p (m) = w. These vertices are easily 
identified: There are iV — 1 of them, located at 



points m 



(2) 



m 



with 



m 



(r) 



Sj :7 .w/ap. As 



u(m^) = w/(ra r p ) , the minimum of u is at 
otherwise (47) u*(w,p) = wmia re { 2 ,..., N }l/{ra r p ). By enlarging 
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the space of r to all real values in [2, oo), we 
get: 

u*(w,p) > wU{p). (53) 



Now we carry the optimization in (f52~l> as: 



bp > min 

u>£[0,oo) 



wll (p) 



which establishes lemma [3j 
Proof of Claim [T] Let z : 



:i- P )e- 

= (1 — pY^ 1 and 



(54) 

□ 



[1 - z) log(l - z) 
z\og[z{l -p)\ 



(55) 



For r > 2 and < p < 1, it is immediate to 
verify that < z < (1 — p) and that any station- 
arity point for U(p) must satisfy g v (z) = 1. By 
studying the function g p (z) it is then possible to 
prove that there are two values z E (0,1) which 
satisfy the latter condition. Furthermore, when 
p — > only one of these two values belongs to 
(0, 1 — p] and it corresponds to z = 1/2 — e(p) 
with e(p) = 0(p). The desired result for U(p) 
immediately follows. □ 

4 Upper bounds on T(N,p) for (3 g 

[0, 1/2) VIA REGULAR-REGULAR GRAPHS 

In this section we prove Theorem [3] On the 
one hand this result allows to complete the 
proof of Theorem [l] namely to identify the 
sharp asymptotic value of T(N,p)/(Np\ logp|) 
in the limit N -> oo, p = N" 13 for (3 G [0, 1/2). 
Precisely, 

Proof of Theorem [I] The proof follows im- 
mediately from Theorem [6] and Theorem |3l □ 

On the other hand Theorem |3] provides a 
constructive procedure for a class of algorithms 
(i.e. a choice of M and a class of matrices 
{Cn,m}) which are asymptotically optimal. 

In order to construct these algorithms we will 
keep in mind the following observations. First, 
as already remarked in the previous section, 
the number of tests due to the undetermined 
ones is negligible for all the choices of p dis- 
cussed in Theorem [3] Therefore we focus on 
algorithms that minimize the number of tests 
in the first stage, M, plus the number of un- 
determined zeros, \Uo\- The second observation 
is that inequality (|33|) comes from (|3~T]) and the 



latter becomes an equality provided M,C N)M 
are such that the corresponding graph has girth 
at least 6. The third observation is the one 
contained in remark |2] which states that the 
minimum for the right hand side of (|33|) is 
attained on any graph with M tests and L 
tests per variable, where M and L have been 
defined in (|T0|) and (fTT) (we recall that both 
M/N and L depend only on p). Therefore if 
it is possible to find at least one graph with 
girth at least 6 among those with M tests and 
L tests per variable, the mean number of tests 
on this graph will match the lower bound in 
Theorem [6] in the limit p — > 0. In the following 
we will use the above ideas and the results 
on regular-regular graphs with a fixed minimal 
girth which have been obtained in [17]. 

Proof of Theorem\3\ Consider a connectivity 
matrix C^j^ with fixed variable degree L, fixed 

test degree K = LN/M and girth at least 6 
(see condition (|56)) ). The proof of the existence 
of such a graph and an explicit procedure 
for its construction have been provided by Lu 
and Moura in their study of large girth LDPC 
codes [17]. Their procedure requires that the 
condition 



M > 



(L — 1)(NL/M) 

Tk-l-W~ 



(56) 



be satisfied. (This condition corresponds to con- 
dition (14) in appendix A of [17] for the choice 
g = 6). In the limit iV -> oo with p = 
with j3 < 1/2, the validity of (f56l> can be readily 
checked, using definitions ([TO]) and ((TT) . 

From equation Q and (f37) , the number of 
tests on any such graph satisfies the inequality 

i™ T M^-, P <M + Np + N(l-p)R p (57) 



where 



M 



R p = J2^(x)Y[w l , a = (i-(i-p) 

X o=l 



K-l\L 



(58) 



The last equality is obtained by using defini- 
tion (|38]> for Wi >a (the mean over the Bernoulli 
distribution is easily performed thanks to the 
girth condition). Theorem |3] immediately fol- 
lows from (l57l). □ 
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The results (|57|) and ([58]) hold for any choice 
of p. However, it is important to notice that 
the existence of at least one such connectivity 
matrix with girth at least 6 is guaranteed only 
for (3 e [0,1/2). In particular for p = N" 13 
with 1 < (3 < 2 there cannot exist any such 
matrix: otherwise (|57|) and (|58|) would imply 
T(N,p) < l3/(\og2) 2 N 1 ~HogN which goes to 
zero as iV — > oo (and is in contradiction with 
the lower bounds in formula (57) of [14]). 

Putting together (|57[>, ((58]) and d36) it is also 
immediate to verify that the higher order cor- 
rections to the optimal value T(N,p) are of 
order pN. More precisely, if we let 

u(at \ T{N,p)- Np\\ogp\{\og2Y 2 
H(N,p) := 



Np 



(59) 



the following holds. 

Remark 3: For (3 e [0, 1/2) in the limit N 
oo the following holds 



1 - 2| log log 2| 



< H(N,p) < 2 



(60) 



5 Regular-Poisson graphs are 
(also) optimal for (3 g [0, 1/2) 

In this section we will prove Theorem [4] which 
shows that asymptotically optimal pool de- 
signs are obtained with regular-Poisson distri- 
butions for proper choices of the graph pa- 
rameters. This is particularly relevant since 
the construction of regular-Poisson graphs is 
much simpler than the construction of [17] for 
regular-regular graphs with girth at least 6. 

Consider the regular-Poisson distribution on 
bipartite graphs defined in section [2] with N 
variable nodes, M test nodes and L tests per 
variable, P^^V. Fix a variable, i, and let Ef be 
the characteristic function of the event (defined 
over the space of all bipartite graphs with M 
nodes) that there are more than n loops of 
length 4 which contain i, i.e. there are more 
than n triples (j, a, b) with j a variable different 
from % (j ^ i, j E (l,...,iV)) and a, b two 
distinct tests (a ^ b, a, b G (1, . . . M)) such that i 
and j belong to both tests. Precisely, we define 
E? : C Njg E {0, 1}^ -las £?(C N p) :=1 if 

N 

c i:a Ci tb c j:b c j: a > n (61) 

3=l,(j&) l<a<b<M 



and £™(C N jj) :=0 otherwise. 

In order to prove Theorem [4] we will need 
the following Lemmas which give an upper 
bound on the probability that there are more 
than n loops of length 4 through i (Lemma SJ 
and an upper bound on the probability that % 
is an undetermined zero and does not belong 
to more than n loops of length 4 (Lemma [5]). 

Lemma 4: 

51 ^NllL^ 'N,Il)£i i^NJd) — 



< 



iVL 6 
M 3 



4\ n+1 

NL 



~M 



(62) 



Lemma 5: Let k be the average degree of the 
checks, k := NL/AL = log2/p, and 



M 



X 



a=l 



(63) 

(we drop for semplicity of notation the depen- 
dence of Wi >a on X, C N jj and the dependence 
of ££ on C N jj). Define also 7 : = p a . 

For any n and a with < n < L/2 and < 
a < 1/2, the following holds 



+Lexp 



(l_p)*(i+7) 
7 2 log 2 



L-2n 



2p 




(64) 



7 2 log 2 
2p 



Proof 0/ Theorem [4} 
For any n and a with < n < L/2 and < 
a < 1/2, the mean number of tests verifies 

E P^O^Tjg^ <M + N P+ 




2\ "+ 1 



Np 3 
+NLexp 



N 



+ No L exp 



;i _ p )fe(i+7) 
7 log 2 



L-2n 



2p 



(65) 



where 7 := p . 

In order to derive (|65l> we have: (i) used 
definition (O; (ii) bounded the mean number 
of undetermined ones with the mean number 
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of ones; (iii) decomposed the mean number of 
undetermined zeros into those that are (are not) 
on a variable node i which contain more than 
n loops of length 4; (iv) upper bounded the 
last two terms via the results of Lemma [4] and 

—-—1/2 

Lemma |U If we now make the choice n := L 
the result of the Theorem immediately follows 
by noticing that: 

Um iVXexp(-7V 1 log2/2) = Q 



7V^oo|/3 



lim 



Np\ logp| 
logp| 3 







N^oo\/3 Np 3 Np\ hgp\ 

N ( logp \ 



lim 

N-*oo\/3 NplOgp \ Np 2 



(66) 
(67) 

(68) 

□ 

Proof of Lemma Given a connectivity 
matrix, C n jj, we identify among the loops of 
length 4 two distinct classes: loops of type 
S and of type D. Loops of type S are those 
disconnected from any other loop, namely they 
correspond to the choices of two variables i, j 
and two tests a, b such that both % and j belongs 
to a and b and there does not exist another test 
containing both i and j. Loops of type D are 
all loops of length 4 which are not of type S. 
For a given variable i, let T>i be the character- 
istic function of the event that i belongs to at 
least one loop of type D. Precisely, we define 
Vi : C NJ1 e {0, l}^ 17 - R as ViiC^jg) :=1 if 

N 

^ ] ^ ] Ci,a,Ci,bCi,cCj,bCj,aCj,c (^-^) 

j=l,(j^i) l<a<b<c<AI 



and A(C 



=0 otherwise. 



The following inequalities hold 

- iVL 6 

c — M 

N,M 



E ^£;(c^* B (i-A)< 

C JV,M 



(70) 



-4 \ «+l 

L N 



~M 



(71) 



(we drop the dependence of Z>j and £j n on 
C n jj.) The result follows immediately from 
f70[) and (j7T|) and the fact that, for any z and 

Proof of Lemma \5\ For a given connectivity 
matrix C N j- Ir let i be a site with less than n 



loops of length 4. Let us call ^(C^^) the set 
of tests which contain i, Bi{C N j^) the set of 
tests which belong to a loop of length 4 passing 
through i, and B l {C N j i ) = A l {C N j i )\B l {C N j i ). 
Clearly \Bi(C N jj)\ < In. The following holds 



M 

ww i>a < _n W i>a 

«=1 063,(0^) 



(72) 



/ 



_n 



n (i 



3=1,. ..AT 



(73) 



We can now plug (|72[) into (|63|) and get 



C p < P NTIT( C N,}vl) 



N.M 



HmC N)JI )\>L-2n) _n [1 



(74) 

a-pr- 1 ] 



where in order to perform the mean over 
fx(X) we used the fact that the neighborhoods 
of any two tests a, b belonging to Bi(C N j^) 
intersect only in i (for any j ^ i one has 
c jaCjb = 0) and we recall from definition ((27[) 
that d a is the degree of test a. Let k max be the 
maximum degree of the first L tests, namely 
k max := max ag(1 >r) £j c j>a . Using dZH) and the 
invariance of the regular-Poisson distribution 
under test permutations, we get 



N 



C v< E P n£l( C n,m) E *M-(1 " (1 " ?)*) 

fc=0 



k\L-2n 



C 

JV,M 



k(l+~f)\L-2n 




(75) 



(76) 

L-2n 



It is now easy to verify that in the limit p — > 0: 

G p < exp[-7V 1 log2/2]+o(exp[- 7 V 1 log2/2]) 

(77) 

and by plugging f77[) into (fTS) the proof is 
completed. 

□ 
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6 Upper bounds on T(N,p) for (3 g 

[1/2, 1) VIA POISSON-POISSON GRAPHS 

In this section we prove Theorem |5j This 
allows to complete the proof of Theorem |2] 
which establishes upper and lower bounds on 

T/(Np\ \ogp\) when p = l/N p and 1/2 < /3 < 1. 

Proof of Theorem |2} The proof follows im- 
mediately from Theorem [6] and Theorem [5j 

□ 

Furthermore Theorem [5] allows to identify 
a class of algorithms over which the upper 
bound is attained. 

Proof of Theorem |5} Consider the class 
of Poisson-Poisson distributions on bipartite 
graphs defined in ((19)) with N variable nodes, 
M test nodes, a mean number of tests per vari- 
able equal to L and a mean number of variables 
per test equal to K = NL/M, P^Tul- From 
(|37) , performing first the average with respect 
to the Poisson-Poisson distribution in which 
the c ia variables are iid, the mean number of 
undetected zeros can be written as: 



E Pn,m,l(Cn,m) E Vp( x ) l^o I 

Cn,m X 



(78) 



x 



i--nfi-x- 



M 



Denoting by r the number of indices j such 
that Xj = 1, this gives: 

E Pw{ C ^m) £M*| = (79) 

Cn,m X 

N ~ l 'N-l\ r N _ r ^ K,_ K x M 




where we recall that q := (1 — p). 
Let 7 := p/\ \ogp\, then 



N-l 



r=N{p+- 1 ) \ / 

< exp[-iV7 W2] + o(exp[-iV7V 1 /2]) 



(80) 



By using definition (0 and the above equations 



([781) . © and (80]> we get 



N^oo\f3 



Tm,c n , m ,p 
Np\ \ogp\ 



< 



lim (pllogpl) 1 

N^oo\f3 



^+ri-^n-^W A 



+iV~ 1 exp(-3Np\ logp|~ 2 / 2 ) + P 
M 



lim 

N^oo\/3 



Np\ \ogp 



+ 



1 - £(l 

1 N v 1 



I£\Np+Nf 
N> 



M 



p\ \ogp\ 



(81) 



By minimizing the last expression on M and 
K we find that the optimal value is taken on 

M = epN\ logp| + o(Np\\ogp\) = M+^o(M) and 
K = 1/p + o(l/ p) = NL/M + o(NL/M), where 



L and M have been defined in (|20|) and (|2T|) . 
Furthermore 



^• M> JVp|logp| 



is easily verified, thus completing the proof of 
Theorem |2] □ 
Remark[T]can be proven along the same lines. 
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